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1 Introduction 



No-regret strategies are simple adaptive learning rules that recently received a lot of 
attention in the literaturejl] In a repeated game, a player has a regret for an action if, 
loosely speaking, she could have obtained a greater average payoff had she played that 
action more often in the past. In the course of the game, the player reinforces actions 
that she regrets not having played enough, for instance, by choosing next action with 
probability proportional to the regret for that action, as in Hart and Mas-Colell's 26 1 
regret matching rule. Existence of no-regret strategies (i.e., strategies that guarantee no 
regrets almost surely in the long run) is known since Hannan 25| ; wide classes of no-regret 
strategies are identified by Hart and Mas-Colell 27 1 and Cesa-Bianchi and Lugosi (l3|l^ 
A no-regret dynamics is a stochastic process that describes trajectories of the average 
correlated play of players and that emerges when every player follows a no-regret strategy 
(different players may play different strategies). By definition, it converges to the Hannan 
set (the set of all correlated actions that satisfy the no-regret condition first stated by 
Hannan [2| 



This set is typically large. It contains the set of correlated equilibria of the 
game and we show that it may even contain correlated actions that put positive weight 
only on strictly dominated actions. Thus convergence of the average play to the Hannan 
set often provides very little information about what the players will actually play, as it 
does not even imply exclusion of strictly dominated actions. 

In this paper we show that no-regret dynamics are intimately linked to the classical 
fictitious play process ll|. Drawing on Monderer et al. 42|, we first show that contrary 
to the standard, discrete-time version, continuous fictitious play leads to no regret. We 
then show that, for a large class of no-regret dynamics, if a player's maximal regret is 
£ > 0, then she plays an e-best reply to the average correlated play of the others. Since in 
this class the maximal regret vanishes (see Corollary [T] below), it follows that, for a good 
choice of behavior when all regrets are negative, the dynamics is a vanishingly perturbed 
version of fictitious play. 



^ These rules have been used to investigate conver genc e to equiUbria in the context of learning in games 
H m m, SJiiLfor combining different forecasts [H, H^] (for an overview of the forecast combination 



literature see |16l l47l|) and for combining opinions, which is also of interest to management science 
In finance this method has been used to derive bounds on the prices of financial instruments fisl 
This method can be applied to various tasks in computer science, such as job scheduling (43] and routing 
(for a survey of apphcable problems in computer science see [35|). 

^This paper deals with the simplest notion of regret known as unconditional (or external) regret 



[22, |27|, |28| . For more sophisticated regret notions, see Hart and Mas-Colell 126|, Lehrer [38[, and Cesa- 
Bianchi and Lugosi 

•^The Hannan set of a game is also known as the set of weak correlated equilibria (43j or coarse correlated 
equilibria 52, Ch.3]. 
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For two-player finite games, this observation and the theory of perturbed differential 
inclusions 00 allow us to relate formally the asymptotic behavior of no-regret dynamics 
and of continuous fictitious play (or its time-rescaled version, the best-reply dynamics 



24l|). In classes of games in which the behavior of continuous fictitious play is well known. 



this provides substantial information on the asymptotic behavior of no-regret dynamics. 
In particular, we recover most known convergence properties of no-regret dynamics. Our 
results do not just allow us to find new and sometimes much shorter proofs of convergence 
of no-regret dynamics towards the set of Nash equilibria in some classes of games, such as 
dominance solvable game or potential games. They also allow us to relate the asymptotic 
behavior of no-regret dynamics and continuous fictitious play in case of divergence, as in 
the famous Shapley game 45 1. 

These results extend only partially to n-player games (though they fully extend to 
n-player games with linear incentives 4^). The issue is that in n-player games no-regret 
dynamics turn out to be related to the correlated version of continuous fictitious play, in 
which the players play a best-reply to the correlated past play of the others. This version 
of fictitious play is defined through a correspondence which is not convex valued. This 
creates technical difficulties, because the theory of perturbed differential inclusions is not 
developed for non convex valued correspondences. 

A different way to analyze no-regret dynamics is to show that some sets attract nearby 
solution trajectories. We show that strict Nash equilibria and, more generally, the inter- 
section of the Hannan set and the sets that are closed under rational behavior (curbj^ are 
attracting for no-regret dynamics, in a sense to be defined in Section HI 

The remainder of the note is organized as follows. The next section introduces no- 
regret dynamics. Section E] studies the links between no-regret dynamics and fictitious 
play. Section |4] shows that the intersection of the Hannan set and curb sets is attracting 
for no-regret dynamics. Section |5] studies the continuous-time version and the expected 
version of no-regret dynamics. Finally, the Appendix contains the proofs of the main 
results, as well as counterexamples illustrating the complexity of the relationship between 
ICT and limit sets. 



■^A product set of action profiles is called closed under rational behavior (curb) [3| if it contains all 
best replies of each player whenever she believes that no actions outside this set are being played by the 
other players. 
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2 Preliminaries 



Consider a bimatrix game F = [Ai, Ui)i=i,2, where Ai is the set of actions of player 
i and : A — > M is her payoff function, with A = Ai x A2. For any finite set 
B, denote by A(i?) the set of probabihty distributions over B. A mixed action of 
player i is an element of A{Ai). A correlated action z is a probability distribution 
over the set of pure action profiles, i.e., z G A{A). Given such a z, let Zi G A{Ai) 
and z^i G A{A^i) denote its marginals for player i and her opponent, respectively. 
Thus, Zi{ai) = J2a^,eA^, ^(^i^^-i)- Throughout, —i refers to z's opponent. As usual, 
let Ui{z) = Y.aeA^i^)^ii^) Ui{k,z_i) = Y.a_ieA_i ^ ^ de- 

pending on the context, may refer to a pure action - an element of Ai - or to a vertex 
of A(Aj), i.e., a Dirac measure on a pure action. 

The game is played repeatedly in discrete time periods t G N* = {1, 2, . . .}. In every 
period t each player i chooses an action ai(t) G Ai and receives payoff Ui{a{t)) where 
a(t) = {ai{t) , a2{t)) . Denote by h{t) = (a(l),a(2), . . . ,a(t)) the history of play up to t, 
and let H be the set of all finite histories (including the empty history). A strategy of 
player z is a function : "H — )■ A{Ai) that stipulates to play in every period t = 1,2,... 
a mixed action qi(t) = qi{h{t — 1)) as a function of the history before t. The weight that 
this mixed action puts on action /c G is denoted by qi,k(t) 

The average correlated play up to period t is z{t) = j Y1t=i ^i'^) ^ where we identify 
a(r) with the corresponding vertex of A{A). Since z(t) = j [a(t) + (t — l)z(t — 1)], it 
follows that for alH > 1, 

z{t) - ^(t - 1) = 1 (a(t) - z{t - 1)) . (1) 

For a correlated action z, the regret of player i for action k is defined as Ri^k{z) = 
Ui{k,z_i) — Ui{z), and her maximal regret as -Ri,max(^) = max^g^. Ri^k{z). Typically we 
deal with the regret based on the average correlated play, z{t), up to some period t. In this 
case the regret of player i for action k ^ Ai is equal to the difference between the average 
payoff she would have obtained by always playing k (assuming that her opponent's play 
remains the same) and her average realized payoff: 

1 * 

Ri,k{z{t)) = Ui{k, z^i{t)) - Ui{z{t)) = - ^[ui{k, a_i(r)) - Mi(a(r))]. 

T = l 

To simplify notations, we will often write Ri^kif) for Ri^k{,z{t)) and Ri^maxit) for Ri^ma.x{z(t)). 
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Player i has no asymptotic regret if her average reahzed payoff is asymptotically no 
less than her best-reply payoff against the empirical distribution of her opponent: 



limsupi?i,max(^) < 0. (2) 



A strategy of player i is a no-regret strategy if for any strategy of the other player, in- 
equality 02]) holds almost surely. This property is also called Hannan consistency 27| or 
universal consistency 22|. 



It is well known in the literature since Hannan 



25| that there exist simple no-regret 



strategies. Hart and Mas-Colell 27| describe a wide class of potential based no-regret 
strategies. A twice different iable, convex function Pi : MA^ — M is called a potential if it 
satisfies the following conditions: 

(Rl) Pi(-) > 0, and Pi{x) = for all x E R^'; 

(R2) VPi(-) > 0, and VPi(x) ■ x > for all x ^ R^'; 

(R3) if X ^ Rji' and Xk < 0, then VkPi{x) = 0, 

where Vfc denotes the partial derivative with respect to Xi{k). The potential Pi can be 
viewed as a generalized distance function between a vector x G M"^' and the nonpositive 
orthant Let Ri(t) = {Ri^k{t))k&A, denote player i's regret vector. 

Proposition 1. Let Pi satisfy (R1)-(R3) and let strategy qi satisfy 

(+ , ^\ VkPi{Ri{t)) v/i ^ A fr^i\ 

whenever -Ri,max(^) > 0. Then qi is a no-regret strategy. 

Proof. This holds by Theorem 3.3 of Hart and Mas-Colell whose conditions (Rl) 
and (R2) are satisfied by our conditions (R1)-( [QT| ) and (R2), respectively, and whose 
proof is based on the Blackwell's Approachability Theorem ■ 

A standard example of no-regret strategy satisfying the above conditions is obtained 
by letting Pi be the /p-norm on i.e. Pi{x) = {J2keAi[^k]+y^^ with 1 < p < oo, where 
[xk]+ = max(0,Xfc). The resulting strategy is called the l^-norm strategy [l3|, l27|. It is 
defined by 

q^k{t + l)= [^-'^^^^]+ ' ^ , VA;eA„ 
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whenever -Rj,max(^) > 0. The Z2-norm strategy is the regret-matching strategy 26|], that 
stipulates to play an action in the next period with probability proportional to the regret 
for that action. For large p, the /p-norm strategies approximate fictitious play. 

We say that the average correlated play z{t) follows a no-regret dynamics if both 
players use (possibly different) no-regret strategies. A trajectory {z(t))i<t<+oo of a no- 
regret dynamics is thus a solution of ([1]) where a(t) is a realization of {qi(t) , q2it)) and 
qi, q2 are no-regret strategies. We focus on the class TZ of no-regret dynamics such that: 

(i) the no-regret strategies gi, q2 of the players are potential-based: they satisfy (Ql) 
for some potentials Pi, P2 satisfying (R1)-(R3); 

(ii) if a player has no-regret then he takes some constant pure action: for each i = 1, 2, 
there exists c E Ai such that 

ai{t + 1) = c whenever -Ri,max(^) < 0. (Q2) 

Our results are valid for a somewhat wider class of no-regret dynamics. What we really 
need, beside a no-regret dynamics, is that from some period to on: 

(i') if a player has positive regret for some actions, then she plays one of these actions. 

(ii') if a player never has any positive regret, then she plays an £:(t)-best reply to the 
empirical distribution of her opponent, where e{t) = e{h(t)) — almost surely. 



Remark 1. Property (i') follows from (R3) and (QT). This is a better reply property that 



stipulates to assign a positive probability only on better reply actions to the opponent's 
empirical distribution of play ("better" with respect to the realized payoff). Also it implies 
that if -Ri,max(^) > in some period t, then -Rj,max(^') > for all t' > t. Indeed, when 
an action k with positive regret is played, the sign of Ri^k{t) does not change, hence the 



maximal regret remains positive 27|, Proposition 4.3]. 



Remark 2. Assumption (Q2) is a simple way of ensuring (ii'), and in addition, that 
if -Ri,max(i) < for all t, then -Ri,max(^) — )■ as t — t- +oo|f| Indeed, if -Ri,max(^) < 
for all t > to then by (Q2), for all t > t^, tRi^dt) = toRi^cito), hence Ri^cif) 0. It 
follows that -Ri,max(^) — > and that for all t > to, player i plays an £:(t)-best reply with 
e{t) := maXfceA, Ui{k, Z-.i{t)) - Ui{c, z^iit)) = -Ri,max(i}_z; Ri,c{t) 0. For a discussion of 
other possible assumptions, see Hart and Mas-Colell 
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Appendix A. 



Note that there are no- regret dynamics that do not satisfy (i'). For instance, stochastic 
fictitious play with a noise parameter that declines with time at an appropriate rate (see. 



^This additional property is needed for Corollary [T] below, but for our main results (ii') suffices. 



6 



e.g., Benaim and Faure [4]). This process is not potential based in our sense due to the 
time inhomogeneity, but this is not the crucial point, since (i')-(ii') would suffice. 

Define the Hannan set H of the stage game F as the set of all correlated actions of 
the players where each player has no regret: 



H=\ze A(A) 



m.a.^Ui{k, z_i) < Ui{z) for each i = 1,2 



The reduced Hannan set Hr is the subset of H in which at least one regret is exactly zero 
for each player: 



Hn=\zE A(A) 



maxuiik, Z-i) = uAz) for each i = 1,2 

k&Ai 



The next property of no-regret dynamics is straightforward by the definition of no 



regret strategies and Remark [2] (see, e.g.. Hart and Mas-Colell 
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Corollary 3.2]). 



Corollary 1. For every no-regret dynamics in class TZ, the trajectories converge almost 
surely to the reduced Hannan set. 

Convergence of the average play z{t) to set Hr does not imply its convergence to any 
particular point in Hr. Moreover, even if z{t) converges to a point, this point need not 
be a Nash equilibrium. 



3 Fictitious play and no-regret dynamics 
3.1 Fictitious play 

In discrete fictitious play, in every period t after the initial one, player i plays a pure best 
reply aiit) to the average past play of her opponent x^iit — 1) := ^ X]t=i (^-ii.^) (here 
a_j(r) is a vertex of A(y4_j)). The latter is called the belief of player i on her opponent's 
next move. Formally, for any x = (xi,X2) in A{Ai) x A{A2), denote by BRi{x-i) player 
2's set of best replies to X-f. 

BRi{x_i) := Ixi e A{Ai) Ui{xi,x_i) = max Ui{k,x_i)\, i = 1,2. 

Let BR{x) = BRi{x2) x BRii^Xi). A discrete-time trajectory (x(t))^i on A(Ai) x A(A2) 
is a solution of discrete fictitious play (DFP) if for every t > 1 
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x{t) - x{t - 1) = ^ {a{t) - x{t - 1)) (3) 

where a{t) = (ai(t), a2(t)) and aj(t) G BRi{x_i{t — 1)) is a vertex of A(Aj) associated 
with some pure best reply action, i = 1,2. 

Analogously, an absolutely continuous function x : [l,oo) — )■ A{Ai) x A(y42) is a 
solution of continuous fictitious play (CFP) if for almost all t > 1, x{t) is differentiable 
and 

iit) = ]iqit)-xit)), 

where q(t) G BR{x(t)) is now a profile of mixed actions. This may be written as the 
differential inclusion: 

xit)e-^{BR{xit))-x{t)). (4) 



The average correlated play satisfies z(t) := j yz{l) + q{T)drj for some initial condition 
z{l) such that Zi{l) = z = 1,2. Thus, for almost all t, z{t) is differentiable and 

m = ]im-m), (5) 

where q = qi®q2 ^ A(A) is the product distribution corresponding, to the mixed strategy 
profile q = (gi, ^2) ^ A(Ai) x A(A2), and is a best-reply to z_iU 

In discrete or continuous fictitious play, the marginals 2^1 (t), Z2{t) of the average past 
play are equal to the beliefs X2{t). By analogy, if z{t) is the average past play 

generated by a no-regret dynamics, it is convenient to call Z-i{t) the belief of player i 
about her opponent's next move. This illuminates a crucial difference between fictitious 
play and no-regret dynamics in class TZ: under fictitious play, a player chooses a best reply 
to her belief, whereas under no-regret dynamics, she chooses a better reply ( "better" with 
respect to her average realized payoff). 



^This definition of CFP guarantees that solutions exist in all games and for all initial conditions 
and that by the change of time scale y{t) = x(e*), CFP corresponds to the best-reply dynamics [2J, |41 



defined by y G BR{y) — y. Another definition of CFP (e.g., Monderer et al. [42|, p. 445] and Berger [S, 
pp. 252-253]) consider only trajectories that are piecewise linear, such that qi{t) is always a pure action 
(technically, a vertex of A(Ai)), and that the times at which q{t) changes have no finite accumulation 
point. This restricted definition is easier to handle, but in many games there do not exist such trajectories 
from every initial condition. 
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3.2 Continuous fictitious play leads to no regret 



It is well known that discrete fictitious play does not lead to no regret 27|, |5l|- Consider 
the following example: 





L 


R 


L 




0,0 


R 


0,0 


V2,l 




Fie. 


1 



Because \/2 is irrational, L and R cannot both be best-replies to the empirical past play of 
the other player. Thus, any DFP process is entirely determined by its first move. Assume 
that the first move is off the diagonal, say (L, R). Due to the symmetry of the game and 
the absence of ties, both players always switch to another action simultaneously. Therefore 
the play is locked off the diagonal and the maximal regret is at least a/2/(1 + a/2) at any 
stage. This holds in the mixed extension of the game, since at any stage the players have 
a unique, pure best reply. 

Since the continuous fictitious play process is a continuous-time version of DFP, intu- 
itively, it should not lead to no regret either. The following result — a generalization of 



Theorem D of Monderer et al. 



42| — shows that this intuition is misleading. 



Proposition 2. Under any solution of continuous fictitious play, the average correlated 
play converges to the reduced Hannan set. 

This discrepancy between DFP and CFP may be explained as follows. Playing an 
action with positive regret decreases the regret for this action. In CFP, roughly, when an 
action is played it remains a best reply, hence it is associated with maximal regret for a 
small time increment. Precisely, the derivative of the regret for the action played is equal 
to the derivative of the maximal regret. Since the regret for this action decreases, so does 
the maximal regret. In contrast, in DFP, an action played at stage t has maximal regret 
at stage t, but not necessarily at stage t + 1. Thus the fact that the regret for this action 
decreases does not entail that the maximal regret does. 
Proof of Proposition [21. For comparison with Hart and Mas-Colell 
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Theorem 

3.1], rescale time (let i = expt) so that becomes z = q — z. For any mixed action 
ai e A{Ai), let 

^iMt) ■= X] '^iik)Ri,kit) = Ui{ai, z_i{t)) - Ui{z{t)) 
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Let Vi{t) = -Rj,max(^)- Note that Ri^k is Lipschitz continuous for all k in Ai. Thus it follows 



from Theorem A. 4 of Hofbauer and Sandholm [30| that, for almost all t, Vi and Ri^k are 
different iable, and for all k such that > 0, we have Viit) = Ri^k{t). It follows that 

'"i = Yl,k%kRi,k = Ri,qi- Furthermore: 



Ri,q^ = Ui{qi, Z-i)-Ui{z) = Ui{qi,q-i-Z-i)-Ui{q-z) = ~[ui{qi, z^i)-Ui{z)] = -R 



Thus, Vi = —Vi. Therefore, Vi{t) converges to zero for all i = 1, 2, hence z{t) — Hr- ■ 

Remark 3. In the proof, we did not use that g_i is a best-reply to Zi. This shows 
that the fact that CFP leads to no-regret is a unilateral property. That is, if a player's 
behavior evolves according to CFP, then she has no asymptotic regret, independently of 



her opponent's behavior (see also Monderer et al. 42|, p. 445]). 



Remark 4. CFP and the best-reply dynamics converge to the set of Nash equilibria in 



finite zero-sum games [3^. The usual proof is to show that the "duality gap" W{x) = 
maxfcg^j ui{k, X2) — min^g^^ Ui{xi, s) converges to zero. This follows from the above proof, 
since in a two-player zero sum game W{x(t)) = Ri,ma.x{z(t)) + R2,max{z(t)), where x is a 
solution of CFP and z the associated correlated play. 

3.3 No-regret dynamics is perturbed CFP 

In the previous subsection we showed that CFP leads to no regret. Conversely, we now 
show that any no-regret dynamics in class TZ (as defined in Section [2]) is closely related 
to CFP. We first explain the intuition. Denote by BRl{x-j) the set of e-best replies of 
player i to the mixed action X-i of her opponent: 

BRl{x_i) = \xi^ ^{Ai) Ui{xi,x_i) >m.&-KUi{k,x_i) - e\, i = l,2. 

The crucial observation is the following. 

Lemma 1. Assume that the maximal regret is less than e. Then any action with positive 
regret is an e-best reply to the average play of the opponent. 

Proof. If player i has positive regret for action aj at some z G A(yl), then Ui{z) — 
Ui{ai,z^i) < 0. But by assumption maxk£AiUi{k, Z-i) — Ui{z) < e. Therefore, max^gyi. 
Ui{k, Z-i) — Ui{ai, Z-i) < e, and ai is an e-best reply to Z-i. ■ 

Since no- regret dynamics in class TZ only pick actions with positive regret, they only 
pick e-best replies to the average play of the others, where e is the maximal regret. 
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Since this maximal regret approaches zero almost surely, eventually only almost-exact 
best replies are picked. This provides the intuition why no-regret dynamics and fictitious 
play may exhibit similar asymptotic behavior. Finding a precise link, however, is not 
obvious. For instance, there could exist actions that are e^-best replies in each period t, 
with Et — 0, but never exact best replies. Thus a limit play of no-regret dynamics may 
include such actions, but this cannot happen under fictitious play. 





L 


R 


L 


1,0 


0,^2 


R 


0,1 


72,0 


C 


r/,0 


r/,0 




Fig. 
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Consider the example shown on Fig. [21 Let r] = y^/ (1 + V^)- It is easy to verify that 
action C is player I's best reply to player 2's mixed action X2 if and only if X2 = (?7, 1 — 77). 
Let us first consider DFP. Since rj is an irrational number, after every finite history of 
play, BRi{x2(t))] consequently DFP never picks C (except, possibly, at the initial 
period) 13 However, it may be shown that under any DFP trajectory, the average play 
X2{t) of player 2 converges to (77, 1 — rj), to which C is a best-reply. It follows that C is 
an e^-best reply to X2{t) for some sequence Et — )■ 0. Thus a no-regret dynamics with the 
same trajectory of the marginal play of player 2 might choose action C a positive fraction 
of time in the long run. 

This example does not apply to CFP, as in this case X2{t) need not be a rational 
number; and as we show below, the asymptotic behavior of no-regret dynamics and CFP 
can be formally related using the theory of perturbed differential inclusions j^, 0]. 

Before stating a precise result, we need some definitions. A set L C A(Ai) x A{A2) is 
invariant under CFP if for every initial point x G L there exists a solution x(-) of CFP, 
defined for all t > (not only t > 1) and such that x(l) = x and x{t) G L for all t > 0. 
A nonempty compact invariant set is an attractor if it attracts uniformly all trajectories 
starting in its neighborhood. An invariant set L is attractor-free if no proper subset of L 
is an attractor for the dynamics restricted to L. A nonempty compact set L is internally 
chain transitive (ICT) for continuous fictitious play if every pair of points in L can be 
connected by finitely many arbitrarily long pieces of orbits of CFP lying completely within 



'^Starting with an arbitrary belief a;2(l) would not help since C is a best-reply only when X2{t) 
[r], 1 — rj), which happens at most once. 
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L with arbitrarily small jumps between themj^ Every ICT set is invariant and attractor 
free g], Property 2]. The limit set of the beliefs of a trajectory z{t) on A(Ai x A2) is the 
set of all accumulation points of its marginals {zi{t) , Z2(t)) G A{Ai) x A{A2) as t — )■ 00. 

Theorem 1. For every no-regret dynamics in class TZ, the limit set of the beliefs is almost 
surely internally chain transitive for continuous fictitious play^ 

We give here a sketch of the proof. The details are given in Appendix A.l. A discrete- 
time trajectory a;2(t))^i on A{Ai) x A(A2) is a payoff perturbed DFP trajectory 
if there exists a positive sequence {st) converging to zero such that ^ holds and Oiit) is 
a vertex of A(ylj) associated with a pure ^f-best reply to — 1), for all i = 1,2 and 
all t > 1. A no-regret dynamics in class TZ generates a trajectory {z{t))'^i on A(y4) and 
an associated sequence of beliefs {zi{t) , Z2{t)) on A(Ai) x A(A2). Building on Lemma [H 
we show that this sequence of beliefs is almost surely a payoff perturbed DFP trajectory. 
By an auxiliary lemma, this implies that this is almost surely a graph-perturbed DFP 
trajectory: a notion similar to payoff-perturbed trajectory, but for another definition of 
perturbed best-reply, the one used in the theory of perturbed differential inclusions 
It follows that the continuous-time interpolation of this sequence of beliefs is almost surely 
a perturbed solution of CFP, in the sense of Benaim et al. . Theorem [1] then follows 
from Theorem 3.6 of Benaim et al. [sl. 

Since ICT sets are invariant, a consequence of Theorem [T] is the following: 

Corollary 2. Let A be the global attractor of CFP (i.e., its maximal invariant set, see 
Benai'm et al. f^]). For any no-regret dynamics in class TZ, the limit set of the beliefs is 
almost surely a subset of A. 



Note the similarity with Propositions 5.1 and 5.2 of Hofbauer et al. 33|], who study 
the links between the time-average of the replicator dynamics and CFP. 



3.4 Applications of Theorem [T] and comments. 

Theorem [T] allows for alternative and sometimes much shorter proofs of most known 
convergence properties of no-regret dynamics. Below, we write that no-regret dynamics 

®For the formal definitions of attractor and attractor-free set see Benaim et al. S p. 675]; for the 
definition of ICT see Benaim et al. 0, p. 337]. Note that the definition of invariance in Benaim et al. 
0j Q applies to the best-reply dynamics, so an appropriate time rescaling must be used to apply it to 
CFP (see footnote (5]). This explains that their definition considers solutions defined for alH G K while 
ours considers solutions defined for alH > 0. 

^In the statement of Theorem [TJ CFP can be replaced by the best- reply dynamics since they clearly 
have the same ICT sets (sec footnote [5]). 
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converge to some set E if the limit set of the behefs is almost surely a subset of E 







(a) For any game which is best-reply equivalent to a two-person zero sum game, the global 
attractor of CFP is the set of Nash equilibria 32| . Hence all no-regret dynamics in class 
TZ converge to the set of Nash equilibria. Actually, in zero-sum games, if the correlated 
action z is in the Hannan set (recall that this is the set of correlated actions that satisfy 
no-regret for all players), then {zi,Z2) is a Nash equilibrium. Consequently, in zero-sum 
games all dynamics that lead to no regret (not only those in class TZ) converge to the 
set of Nash equilibria. This holds more generally for stable himatrix g'ames_ 30[, because 



these are rescaled zero-sum games in the sense of Hofbauer and Sigmund 3l|, as is easily 
shown and was known to Josef Hofbauer (private communication) . 

(b) For games with strictly dominated strategies, the global attractor of CFP is contained 
in the face of the simplex with no weight on these strategies. Hence all no-regret dynamics 
in class TZ converge to this face. Similarly, these dynamics converge to the unique Nash 
equilibrium in strictly dominance solvable games. 
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Contrary to (a), this need not be true for all dynamics that lead to no regret. Indeed, 
convergence to the Hannan set or even to the reduced Hannan set does not guarantee 
elimination of strictly dominated strategies. Consider, for instance, the games shown on 
Fig.[3l Both games are symmetric, so we indicate only the payoffs of the row player. Game 
(i) is an identical interest game which is strictly dominance solvable; yet the correlated 
action putting probabilities 1/3 on each diagonal square is in the reduced Hannan set. 
For e = 0, game (ii) is a coordination game with duplicate strategies. For e > 0, the 
duplicates A", B~ are penalized and become strictly dominated. Thus, the correlated 
action putting probability 1/2 on {A~,A~) and 1/2 on {B',B 
strictly dominated actions. Yet, for e < 1/2, it belongs to the Hannan setlll 



puts only weight on 



^°Note that some applications of Theorem [T] (points (a), (b) and (c) below) lead to the same conclu- 
sions about no-regret dynamics as those about the time average of the replicator dynamics described in 
Hofbauer et al. 0, p. 267, points (2), (3) and (4)]. 

^"'^See also the game of Moulin and Vial p. 205], where the third strategy of player 1 is strictly 
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(c) In weighted potential games, all internally chain transitive sets of CFP are (subsets 
of) connected components of Nash equilibria on which the payoffs are constant [see 0, 
Theorem 5.5 and Remark 5.6]. Hence by Theorem [1], all no- regret dynamics in class TZ 
converge to such components. Note that the original proof is much longer 28|, Appendix 
A]. 

(d) If the beliefs {zi{t), Z2{t)) of a no-regret dynamics converge to the set of Nash equi- 
libria, then the average realized payoff converges to the set of Nash equilibrium payoffs. 
To see why this is true, let z G A{A) be a limit point of {z{t)} and let the marginals 
(21,^2) £ ^(^1) X ^(^2) constitute a Nash equilibrium. By Corollary [T] the maximal 
regret converges to zero, so for every i = 1,2 



Ui{z) = maxui{k, z_i) = Ui{zi,z_i). 

This result illuminates an important difference between no-regret dynamics and discrete 
fictitious play. It is well known that under DFP, if the beliefs of the players converge 
to a Nash equilibrium, their average realized payoffs need not approach the set of Nash 
equilibrium payoffs, whereas under no-regret dynamics it is always the case. 
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(e) Consider the 3x3 game of Fig. H] due to Shapley [45| , the historical counterexample 
to the convergence of fictitious play. This game has a unique equilibrium, in which 
both players randomize uniformly. Though this equilibrium attracts some solutions of 
continuous fictitious play (e.g. all those that start and remain symmetric), almost all 



solutions converge to a hexagon, the so-called Shapley polygon |23|, |45|, |46|. It may 
be shown that the only ICT sets are the Nash equilibrium and the Shapley polygon. 
Consequently, the limit set of any no-regret dynamics in class TZ is almost surely one of 
these two sets. 

(f) In a number of classes of games, convergence of discrete fictitious play to the set of 
Nash equilibria has been established, but analogous results for continuous fictitious play 



dominated but has a positive marginal probability under some correlated actions in the Hannan set. 
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are lacking. Thus we cannot use Theorem [H These classes of games include generic 2 x n 
games [7L generic ordinal potential games, quasi-supermodular games with diminishing 
returns (8|, and some other special classes (see, e.g.. Sparrow et al. 46|, p. 260]). For ordi- 
nal potential games and quasi-supermodular games with diminishing returns, Berger jsl 
proves convergence to the set of Nash equilibria of some solutions of continuous fictitious 
play as defined by (jlj) (see our footnote [6]) . This is not enough to apply the results of 
Benai'm et al. 5|. The same problem arises in Krishna and Sjostrom [36j. Actually, as 
explained below, convergence of CFP to the set of Nash equilibria would not suffice to 
use Theorem [D we would need some additional structure, such as a Lyapunov function, 
to get more information on the ICT sets. 

(g) Consider a bimatrix game in which all solutions of CFP converge to the set of Nash 
equilibria. Because the definition of attractor requires uniform attraction, this does not 
imply that the set of Nash equilibria is an attractor. Neither does it imply that all ICT 
sets are contained in the set of Nash equilibria, as shown in Appendix A. 2. Therefore, we 
cannot deduce from Theorem [T] that no- regret dynamics in class TZ converge to the set of 
Nash equilibria; whether this is always the case remains an open question. 

(h) We show in Section O that Theorem [T] also applies, and under weaker assumptions, 
to the continuous-time version and to the expected version of no-regret dynamics in class 
TZ. As apparent from the proof, the existence of a potential is not essential: for a good 
choice of behavior when there are no regrets. Theorem [T] holds for any no-regret dynamics 
such that a player always chooses an action with positive regret whenever he has one. 
It also applies to certain no-regret dynamics that do not have this property, such as the 
exponential weight algorithm (see Remark at the end of Appendix A.l). 

(i) Let us now comment on extensions of our results to n-player games. The definition of 
no-regret dynamics, as well as Proposition [H extend to the ra-player setting straightfor- 
wardly (e.g.. Hart and Mas-Colell 27|). The appropriate extension of CFP is correlated 
CFP where at each time t every player chooses a best reply action to the correlated past av- 
erage play of the others. Specifically, an absolutely continuous function z : [1, oo) — )■ A{A) 
is a solution of correlated CFP if it is almost everywhere differentiable and satisfies 



1 



z{t)e-iBR{zit))-zit)), 



where the correlated best-reply correspondence BR : A{A) =^ A (A) is defined by BR{z) 
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Also known as games of strategic complementarities (e.g., Tirole |48|). 
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x'l^iBRi{z_i) where BRi{z_i) is the set of mixed best rephes of player i to the correlated 
action z^i G A (A-i). 

In n-player games with linear incentives 4J] , also known as polymatrix games 50|] , the 
correlated and independent best-reply correspondences coincide; that is, for any correlated 



action z G A (A), BR{z) = BR{{zi, .., Zn)) where (-^i, Zn) is the vector of marginals of z, 
and BR the standard (independent) best-reply correspondence. For such games, Theorem 
[T] extends easily. However, this is not the case in general. The main problem is that the 



correlated best reply correspondence is not convex valued; that is, BR{z) is not in general 
a convex subset of A(y4)0 This creates two issues: 

(i) Existence of solutions of correlated CFP is not guaranteed by the classical results 



on differential inclusions we are aware of (e.g., Aubin and Celina 

(ii) The theory of perturbed differential inclusions does not apply to non-convex 
valued correspondences. 

The first issue can be solved by building piecewise linear solutions of correlated CFP 
following the same ideas as for two-player games (see Hofbauer Moreover, due to 

Remark [3l Proposition [2] extends to the n-player setting. It then asserts that correlated 
CFP leads to no regrets. Lemma [T] also extends: it asserts that if the maximal regret of 
player i is less than e, then she plays only e-best reply actions to the correlated average 
play of the opponents. It follows that, analogously to two-player games, interpolated 
trajectories of no-regret dynamics are almost surely perturbed solutions of correlated 
CFP. However, we cannot proceed to an analog of Theorem [T] because of the second issue. 
Thus, whether there is a formal relation between no-regret dynamics and correlated CFP 
in 77,-player games remains an open question. Similarly, the results of Hofbauer et al 



on the links between the time-average of the replicator dynamics and CFP are restricted 
to bimatrix games (or games with linear incentives). 



^■^This is due to the fact that elements of BR{z) are independent distributions and that the average of 
two independent distributions need not be an independent distribution. 

Assuming that z{t) is well defined, call Gr{t) the game in which the players are reduced to their best- 
rcplics to z{t). Start with some initial condition z(to). Then point to a Nash equilibrium of Gr{ta) (i.e. 
fix 5 e N E{Gr{Ta)) and choose q{t) = h) till the first time, ti, when, for some player z, a strategy which 
was not a best-reply to zito) is a best-reply to z{ti). Then iterate. If the times t„ accumulate towards 
some time i*, then use the fact that z{t) must have a limit when t ^ t* (because z{t) is Lipschitz). 
Call it z{t*) and restart from z{t*). Note that there might in principle be a countable infinity of such 
accumulation points t*, and that they might themselves accumulate in some point t**, but then define 
z** as before and restart from there, etc. The largest (forward time) interval on which such a solution 
can be built is both open and closed in [to, +oo) and is thus equal to [Iq, +oo). 
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4 Curb sets 



Theorem [T] does not answer whether attracting sets of CFP have an analogous property 
under no-regret dynamics. 

A set C C A{A) is eventually attracting under a no-regret dynamic process if with any 
given probabihty it captures all no-regret trajectories originating from a small enough 
neighborhood of C at all distant enough periods. Formally, C is eventually attracting if 
for every vr > there exists e > and a period T such that: for every to > T, if z{to) is 
in an e-neighborhood of C, then z{t) converges to set C with probability at least 1 — ttF^ 

For this section it is convenient to replace assumption (Q2) by the following one: 

If a player's maximal regret is nonpositive, then she plays a best-reply (Q2') 
to the empirical distribution of her opponent. 

This is not essential, since the interesting histories are those where both players have 
positive regrets, in which case (Q2) plays no role,^ 

A strict Nash equilibrium is eventually attracting. Indeed, if z(to) is close enough to 
a vertex of A(y4) corresponding to a strict Nash equilibrium a = (01,02), then for each 
player i, action is the unique best reply and there is a negative regret for any action 
other than Oj. Since by (R3) only actions with positive regret can be chosen, and by 
(Q2') only best-reply actions can be chosen if all regrets are nonpositive, action will be 
played by each player i in the following period, and so on. 

Let us now consider a standard generalization of strict Nash equilibria. For each 
i = 1,2, let Bi C Ai. With a slight abuse of notation, denote by A{Bi) the set of 
probability measures on with support on only. The product set B = Bi x B2 is 
closed under rational behavior (curb) (Basu and WeibuU jsl) if 

BRi{x^i) C A(5i) whenever x_i G A(i?_i), i = 1,2. 

That is, the set B is curb if the players' pure best reply profiles are contained in B 
whenever they believe that no actions outside of B should be played. 

Curb sets are known to be attracting under CFP (e.g., Balkenborg et al. {2], Lemma 



^^We say that z{t) converges to C if infcgc ll^(^) — c|| as t — > 00. 

^^Recall that by Remark[l] if a player has positive maximal regret, then it remains positive forever. So 
we can consider histories from a distant enough period to where both players have positive regrets and 
(Q2) plays no role. If to does not exist, i.e., some player always has nonpositive maximal regret, then 
Proposition [T] and (Q2) imply that her play is constant, whereas her opponent's play must approach a 
best reply to it, leading to Nash equilibrium. By replacing (Q2) by (Q2') we avoid dealing with this issue. 
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7]). However, they need not be attracting under no- regret dynamics in class TZ. Indeed, 
even if the support of 2 (to) is contained in some curb set 5, there may be positive regrets 
for actions outside of i?, since B need not be closed under better replies. However, we 
show that the intersection of the Hannan set and the set of correlated actions with support 
on a curb set is eventually attracting. 

Formally, let B = Bi x B2 he a. curb set. Let Ab{A) denote the set of correlated 
actions with support on B only. Let Hb = H (1 Ab{A). 

Proposition 3. For every curb set B, the set Hb is eventually attracting under every 
no-regret dynamics in TZ. 

The proof is based on the following observations. For every curb set S, if the average 
play is close enough to Hb-, then regrets for all actions outside of B are negative (since 
B is curb). Hence, by condition (R3), only actions in B will be played in the immediate 
future. On the other hand, almost sure convergence of maximum regret to zero suggests 
that, so long as the players choose only actions in 5, the average play will approach Hb-, 
thus reinforcing the former observation. To prove the result, however, we need to establish 
bounds on the maximal future regret conditional on certain histories (namely, conditional 
on being close to Hb) that Hart and Mas-Colell 27|] do not provide. The complete proof 
is relegated to Appendix A. 3. 



5 Continuous-time and expected no-regret dynamics 



We now prove an analog of Theorem [T] for continuous-time dynamics [28| and the expected 
version of discrete-time dynamics. Both describe trajectories of average intended (mixed) 
play, rather than average realized (pure) play. For this reason, condition (R3) is not 
needed. Indeed, the interest of (R3) is that, together with (Ql), it requires every realized 
action to be a better reply to the opponents empirical distribution of play (whenever such 
actions exist). But now we only need every mixed (expected) action to be a better reply, 
and this follows already from conditions (R1)-(R2) and (Ql). Besides, these dynamics 
are deterministic, hence the results we obtain hold surely (not just almost surely). The 
proofs are based on Appendix A.l and are best understood after reading it. 
Consider a continuous-time dynamics 

m = \m)-<t)) (6) 

where q{t) = qi(t) ® q2(t) G A{A) is the (independent) joint play at time t and z(t) the 
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average correlated play. There are two differences with ([T]): time is now continuous, and, 
more importantly, reahzed play a{t) has been replaced by intended mixed play q(t). As 
in CFP, start at time 1 with some initial condition z{l) G A(y4). Assume that whenever 

*''=""E....V,P,(fl,(i))' 

where Pi is a potential function satisfying (Rl), (R2) and the technical condition: 
(P4') There exists < p2 < oo such that V Pi{x) ■ x < p2Pi{x) for all x G R^' . 
This is a part of condition (P4) in Hart and Mas-Colell 



Proposition 4. Let z{t) be a solution of ^ and ^ with Pi satisfying conditions (Rl), 
(R2) and (P4') for all i = 1,2. Assume that the initial condition z{l) is such that both 
players have some positive regrets: -Ri,max(l) > for all i = 1,2. Then the limit set of the 
beliefs is internally chain transitive for continuous fictitious play. 



Proof. Let ei{t) := -R.j,max(^)- Hart and Mas-Colell [28|, Theorem 3.1 and Lemma 3.1 
show that if £^(1) > 0, then ei{t) > for all t, and ei{t) — > as t — ?■ +oo. Moreover, by 
(R2) applied to x = Ri(t) and definition of qi, we have: Ui{qi, z_i) — Ui{z) = qi-Ri > (this 
is equation (3.3) in |27|). Thus by Lemma [U qi G BR^''^^\z_i). Together with Lemma 
[3] in Appendix A.l, this implies that {zi{-) , Z2{-)) is a perturbed solution of CFP in the 



sense of Benaim et al. 131. The result then follows from Theorem 3.6 of Benai'm et al. 



Remark 5. Assume that if all initial regrets of a player are nonpositive then the dynamics 



28|, equation (4.9). Then it is easily seen that the 



is defined as in Hart and Mas-Colell 
result of Proposition m holds for any initial condition -2(1). 

Expected discrete-time dynamics. The expected motion in (JT]) is described by 

z{t) -z{t-i) = j m - z{t - 1)) . 

where q{t) = qi{t) ® q2{t) is the expectation of a{t). Assume that qi is derived by (QI) 
from a potential function satisfying (R1)-(R2). Let ei{t) := Ri,ma.x(t). It is easily seen 
that, as for continuous-time dynamics, ei{t) — as t — ?■ +oo, and if £i(l) > 0, then for 
all t, ei{t) > and g,j G BRl'^^\z_i). Due to Lemmata [3] to [S] of Appendix A.l and to 



^^Note a typo in the proof of Lemma 3.3 in Hart and Mas-Colell [28[: (P3) should be replaced by (P4). 
Moreover, only our condition (P4') is used in the proof of Lemma 3.3 in [28|. 
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Theorem 3.6 of Benaim et al. ||5|, it follows that for a good choice of behavior when all 
regrets are initially nonpositive, the limit set of the beliefs is internally chain transitive 
for CFP. 



Appendix 

A.l Proof of Theorem [1] 

Denote by BRf{x) the correspondence whose graph is the e-neighborhood of the graph 
of BRy. 



Xi e A{Ai) 



3(x*,x!.J G A(Ai) X A{A2) s.t. 

X* e BRi{x*_i), and \ \{x*,x*_i) - ixi,x_i)\\^ < e 



Let BR^{x) = BR\{x2) x BR\{xi). In words, action Xi is an e-graph perturbed best reply 
to x_j if there is an action e-close to Xj which is an exact best-reply to an action e-close 
to X-i. This is the notion of perturbation used in the theory of perturbed differential 
inclusions (Benai'm et al. . As illustrated by the example below, it is different from 

the notion of perturbation of payoffs in the e-best reply correspondence, i.e. BR'^{x) = 
BRl{x2) X BRl{xi) with 



BRl{x. 



<Xie/\{Ai) Ui{xi,x_i)>m.a.-xui{k,x_,i)-e\, z = l,2. 
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Fig. 5 

Consider a game where the payoffs of player 1 are given by Fig [51 Let e G (0, 1/2) 
and let x| = + ^) L + — e) R. The pure action C is a 2£:-best reply to x|. Using 
the sup norm, it is at distance 1 from pure action T, the unique exact best reply to x|. 
Nevertheless, C is an e-graph perturbed best reply, because it is an exact best reply to X2, 
which is e-close (in sup norm) to x|. By contrast, for all 77 > 0, action B is an (£: + ?7)-best 
reply, but only a 1-graph perturbed best reply to X2- 
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A discrete-time trajectory {xi{t),X2{t))'^i on A(y4i) x A{A2) is a payoff perturbed 
fictitious play trajectory if tliere exists a positive sequence {et) converging to zero sucli 
tliat 

x{t)-x{t-l) = ^{q{t)-x{t-l)) 

with q{t) = {qi{t),q2{t)) and qi{t) E BRf{x_i{t - 1)) for all i = 1,2 and all t > 1. 
It is a graph perturbed fictitious play trajectory if the same holds but replacing BRf 
with BRl*. A trajectory {z{t))^i on A(y4) generates a sequence of belief s {zi{t) , Z2{t)) in 
A(Ai) X A(A2) . 

The proof goes as follows. Lemma [2] shows that the sequence of beliefs generated 
by a no-regret dynamics is a payoff perturbed FP trajectory. Together with Lemma [3l 
this implies that it is a graph-perturbed FP trajectory (Lemma Hj). It follows that the 
interpolated process of a no-regret dynamics trajectory is a perturbed solution of CFP 
(Lemma [5]). The result then follows from Benaim et al. [sl. 

Lemma 2. The sequence of beliefs of a solution of a no-regret dynamics in class TZ is 
almost surely a payoff perturbed DFP trajectory. 

Proof. If -Rj,max(^) < for all t, then by Remark [21 player i plays an £(t)-best reply for 
some e{t) converging to zero. Otherwise, -Ri.max(^o) > for some t^ eW . Then for all 
times t > to, -Rj,max(^) > (by Remark [1]) and player i plays an _Rj^max(^)-best reply by 
Lemma [T] and conditions (R3) and (Ql). Since -Rj,max(^) almost surely, the result 
follows. ■ 

Lemma 3. Let X be a compact subset ofW^ and F a correspondence from X to itself. For 
any 6 > 0, let Fs : X ^ X denote the correspondence whose graph is the 6 -neighborhood 
of the graph of F: 

Fs{x) = {yex\ 3{x*,y*) G X^ s.t. y* e F{x*) and \\{x*,y*) - {x,y)\\^ < ^j. 

For any a > 0, let Ga be an u.s.c. correspondence from X to itself. Assume that for each 
X in X : 

(i) a < a' ^ Ga{x) C Ga'{x) (that is, {Ga)a>o is increasing w.r.t. inclusion); 

('V r\a>oGaix) C F{X). 

Then for every 6 > there exists a > such that for each x in X, Ga{x) C Fs{x). 

Proof. By contradiction, assume that there exists 6 > 0, a. decreasing sequence (a„) con- 
verging to zero, and sequences (a;„) and (y„) of points in X such that yn G Ga„{xn)\Fs{xn) 
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for all n. By compactness of X, we can assume that and (?/„) converge respectively 
to X* and y* . Fix A; e N. For all n > k, yn & Ga„{xn) C Ga,,{xn) by (i). Since is 
u.s.c, it follows that y* G 6*0,^,(0;*). Therefore, by (i) and (ii) 

G fl G^^ix*) = fl G,(a;*) C F{x*) 

But for n large enough, — {xn,yn)\\oo < hence ?/„ G ^^(xn), a contradiction. ■ 

Applied to the best-reply correspondence. Lemma [3] implies that for any 6 > 0, an e- 
perturbed best-reply is a 5-graph perturbed best-reply, provided e is small enough. Thus 
we have the next result. 

Lemma 4. Any payoff perturbed DFP trajectory is a graph perturbed DFP trajectory. 
Proof. Let St — 0. Let 

6t = min > Vi = 1, 2, Vx G A{Ai) x A{A2), BRf{x^i) C BRf{x^,^ } . 

Applying Lemma [3] with X = A{Ai) x A(A2), G^ = BR" and F = BR, we obtain that 
— )■ 0. The result follows. ■ 

Given a discrete-time trajectory x(n) = (xi(n), X2(n)) on A(Ai) x A{A2), with n G 
N*, define its interpolated process x : [l,+oo) — )■ A(y4i) x A{A2) as follows. For all 
t G + 1) let = nx{n) + {t — n)q{n), where qi{n) = {n + l)xi{n + 1) — nxi{n), 

i = 1,2. This is equivalent to 

t — Tl 

Xi{t) - Xi{n) = —^{qi{n) - Xi{t)), i = 1,2. 
Hence for all t G (n, n + 1) we have \ \x{t) — x(n)||oo < and 

x{t) = ^iqin)-xit)) (8) 

An absolutely continuous function x : [1, +00) — )■ A{Ai) x A{A2) is a perturbed solution 
of CFP if there exists a vanishing function e : M_|_ — ?■ such that for almost all t, 

X G ^ [bR"^*\x) - x^ where x = x{t). (9) 

Lemma 5. The interpolated process of a graph perturbed DFP trajectory is a perturbed 
solution of CFP. 
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Proof. Consider a discrete time trajectory {xi{n),X2{n))n£N such that 



Xi{n) - Xi{n - 1) = - {qi{n) - Xi{n)) , i = 1,2, 
n 

with qi{n) G BRf^(x_i{n — 1)) and — )■ 0. For all n and all t G [n, n + 1), let e{t) = 
en + 2/n. Obviously, e{t) — )■ as t — oo. Moreover, for all t E (n,n+ 1), the interpolated 

m 



process satisfies — x^i{n — l)||oo < + - < 2/n, so G BRf^\x 



la. Proposition 2.2]). 



Therefore ([8j) implies (|9]) (see also Faure and Roth 

We can now prove Theorem [T] By Lemmata [2] and HI the sequence of beliefs of a 
solution of a no-regret dynamics in class TZ is almost surely a graph perturbed DFP 
trajectory. Hence, by Lemma [5l its interpolated process x{t) is a perturbed solution 
of CFP. This implies that x(e*) is almost surely a perturbed solution of the best-reply 
dynamics, in the sense of Benai'm et al. s]. Definition II]. Theorem [T] now follows from 
Theorem 3.6 of Benai'm et al. [sl-ifl 

Remark 6. Assume that at stage t, for each i = 1,2, player i chooses a pure action 
according to a mixed action qi{t) that depends on the previous history h{t — 1). Do not 
assume conditions (R1)-(R3) and (Ql), but assume that there exists a vanishing sequence 
(et) such that for for alH > 1 and any previous history h{t — 1), qi{t) G BB^*{z_i(t — 1)), 
i = 1,2. Then it follows from Lemma [31 the above proof and Benai'm et al. |5|, Proposition 
1.4 and a variant of Proposition 1.3] that Theorern [T] applies. As is well known, this is 



the case for the exponential weights algorithm 2l|, |39|] that corresponds to 



expPtUi{k, z. 



with Z-i = Z-iit — 1), /3t — )■ +00 as t — )■ oo, and Pt < for some a G (0, 1) to ensure that 
this is a no-regret dynamics (see, e.g., Bena'im and Faure j4|). The above assumptions 
are not (or not trivially) satisfied by no-regret dynamics in class TZ. Indeed, the rate at 
which the maximal regret vanishes, hence the value et such that qi{t) G BRl*{z-i{t)), may 
depend on the trajectory. 



A. 2 ICT sets when all solutions converge to Nash equilibria 

The fact that all solutions of the best-reply dynamics converge to the set of Nash equilibria 
does not guarantee that ICT sets contain only Nash equilibria. We provide counterexam- 

^^Thc definition of perturbed soiution in Benai'm et ai. Q is different from ours but equivaient. 
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pies below. 

Example 1 (single-population dynamics). Consider the following symmetric 3x3 game: 





A 


B 


C 


A 











B 











C 


-1 









Denote a mixed action by x = {xa-,xb-,xc)- Then ( Nash equilibrium if and 

only if Xyi = or xc = 0. It is easily seen that all solutions of the best-reply dynamics 
converge to the set of symmetric Nash equilibria. However, the whole state space is ICT. 
Indeed, any mixed action x can be connected to any other mixed action y as follows: 
starting from x, follow a solution pointing towards the edge Xc = 0, then jump on this 
edge and follow a solution pointing towards the pure strategy B; once close to -B, jump 
on the edge xa = 0, and follow a solution pointing towards C; once close to C, make a 
small jump to reach a point from which a solution points toward follow this solution 
and if needed (i.e. if = 0), make one more jump to reach y. 

This example is also valid for the replicator dynamics and any payoff monotone dynam- 
ics in the sense of, e.g., Hofbauer and WeibuU 3J]. The only difference is that traveling 
from Aio B and from B to C cannot be done by following solutions of the dynamics but 
only through long sequences of jumps. Note also that in an inward cycling Rock-Paper- 
Scissors game (see e.g., Hofbauer and Sigmund 31|, or Weibull j^]), all solutions of the 
replicator dynamics converge to one of the four rest points but the whole boundary of the 
state space is ICT (for the replicator dynamics). 

Example 2 (ri-population dynamics). Similarly, in the bimatrix version of example [H all 
solutions of the two-population best-reply dynamics converge to the set of Nash equilibria 
but the whole state space is ICT. Again, this is true for all payoff monotone dynamics. 
Similar examples can be given for ra-population dynamics for any n > 1. 

At least for the best-reply dynamics, 2x2 examples can also be given. Consider, for 
instance, the 2x2 game: 

' L R 



T 0,0 0,0 
B 0,0 -1,0 

Denote mixed actions of players 1 and 2 by x = {xt,xb) and y = {yL,yR), respectively. 
The set of Nash equilibria is the union of the edges xb = and yn = and all solutions of 
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the two-population best-reply dynamics converge to this set. However, the whole triangle 
a^T + 1/L > 1 is ICT. 

In these examples, a direct analysis shows that all solutions of no-regret dynamics in 
class TZ converge to the set of Nash equilibria. Thus we do not know whether, in general, 
convergence of all solutions of CFP to the set of Nash equilibria entails convergence of 
no-regret dynamics. The point is that this is not guaranteed by Theorem [H 



A. 3 Proof of Proposition [3] 

We need some notation. For z G A (A) and a & A, let Za denote the probability of a under 
the correlated action z. Let IA^{Hb) be the neighborhood of Hb in which the total weight 
on action profiles outside of B and the potential of each player are below 7: 



U^{Hb) = {ze A(A) 



EZa < 7, and 
Pi{Ri{z))<^, ^ = 1,2, 



where Ri{z) is the regret vector of player i: Ri{z) = [ui{s, z^i) — Ui{z)) . 
Let i? = i?i X i?2 be a curb set. Let 



(5b = min min < maxujfs, — max uAk^z^. 

j=l,2 2_ieA(_B_i) ys(^Ai keAi\Bi 

and note that 6b > 0, since B is curb. Now, consider a no-regret dynamics in TZ defined 
by potentials Pi, i = 1,2, with trajectory {z(t))t>i. Let Pi{'y) be the smallest number such 
that for all z e A{A) 

Pi{z) < 7 =^ Ri,rae.^{z) < Pi{j), 

and let p(7) = max{pi(7), ^2(7)}- Let 7^ be the solution of 

{2U + 6BhB + p{ib) -Sb = (10) 

where U = maxj=i^2 maXagyi is a payoff bound. Since p(7) is weakly increasing in 

7 and p(0) = 0, there exists a unique solution 7^ of ffTOl) and 7^ > 0. 
Consider the following event Sf. 

Pi{Ri{t + n)) < 7ij for each i = 1,2 and all n e N*. (St) 

The statement of Proposition [3] is immediate by the following claims and Corollary [H 
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Claim 1. If z{t) G IA^j^{Hb) and event St holds, then a{t + n) E B for all n eW 



Claim 2. For every ir G (0, 1] and every 7 G (0,7^) there exists to such that for every 
t > to, if z{t) G V(^{B), then event St holds with probability at least 1 — vr. 

For the proof of Claim [1] we need the following lemma. 

Lemma 6. For any t, if z(t) G IA^^^Hb), then a(t + 1) G -B. 

Proof. Let z G A(A). If z is close enough to B, then maXg^A, Ui{s, max^g^.^^^ Ui{k, 2_ 
6b/2 for all i = 1,2. If z is close enough to H, then maXg^AiUi^s, z^i) — Ui{z) < 5b/2. 
Thus in the neighborhood of Hb-, maXk^AiXBi'^iik, z^i) < Ui{z) hence Ri^ki^) < for all 
k G Ai\Bi. In particular, this holds if 2 G U.yg{HB) (we omit the proof: easy but lengthy). 
It follows that if z{t) G U^g{HB), then by conditions (R3) and (Q2'), ai{t + 1) E Bi for 
each i = 1,2. ■ 

Proof of Claim [H Suppose that St holds and let z{t) G U^g{B). Then a{t + 1) E B hy 
Lemma O We proceed by induction. Assume a{t + 1), . . . , a(t + n) E B for some n E N*. 
Since z(t) E U^g{B), 

Za{t + n)< Za{t)<-fB- 
a£A\B a£A\B 

Together with St, this implies that z(t + n) E lA^g{B). Consequently, by Lemma [6l 
a{t + n + l) E B. ■ 

The proof of Claim 2 builds up on the proof of Theorem 2.1 of Hart and Mas-Colell 
27|. It is different though, since we need to find the convergence rate of of the maximal 



regret conditional on a given initial history (in particular, on those where the past average 
play is close to a curb set), which Hart and Mas-Colell 27| do not provide. So our result 
cannot be directly derived from their proof. 

For the proof of Claim |5] we need the following lemmata. 

Lemma 7. Let Xi, X2, ... be a sequence of real random variables with i?[x„|x„,_i, . . . ,Xi] = 
and Var[xn\ < for all n. Then for every vr > and every m = 1,2, . . . 



Ft 



max 

n>m 



1 



k=m+l 



> 



a 



'm-K 



< vr. 



Proof. Hajek-Renyi inequality (e.g., BuUen 12|) implies 



Pr 



max Cfc \x,n+i + . . . + Xk\ >e 

m<k<n 



< 



1 V — 9 

7^1^, ^-,^kVar[xk]. 

S 'fe=m+l 
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Using Cfc = 1/k and V^ar[a;fc] < a^, the right-hand side can be bounded as follows, 

clVar[xk] <(J^y 7 tT? ^ ^ 

Taking the limit n — ?■ oo yields 

Pr 



1 , 

max — \Xm+i + ■ ■ ■ + Xk\ > £ 



< 



k>m k 

The result is immediate by substitution n = a"^ /{me^). ■ 
Define ^(1) = Pi{Ri{l)) and for alH = 2, 3, . . . 

m = tP^{R^{t)) - (t - l)P^{R,{t - 1)). (11) 

Lemma 8. ^{t) is uniformly hounded and E[^{t)\h{t — 1)] < C /t holds for some constant 
C uniformly for all t. 

Proof. Let xq = Ri{t - 1) and x = Ri{t). Note that Ri{t) = ^-^-Riit - 1) + ^r^, where 
ri = [ui{k,a_i{t)) - ^ii(a(t))]fcg^^. Hence 

X - xq = hri - xo). (12) 

The regret for an action is bounded by 2U and the difference between two regret terms 
by 4U. Thus, in sup norm, ||rj — xo|| < 4f7 and ||x — Xo|| < 4f7 /t. 

Since Pi is C^, there exist constants c, c' and c" such that if \\y\ \ < iU, 

\\Pi{y)\\<c, \\VP^{y)-y\\<c'\\y\\, and \\y ■V^P,{y)y\\ < c"\\y\\\ 

Moreover, ^(t) = Pi{xo) + t{Pi{x) - Pi{xo)) hence \^{t)\ < c + tc'\\x - xo|| < c + 4JJc'. 
Thus C,{t) is uniformly bounded. 

We now show that E[^{t)\h{t - 1)] < C/t for C = Sf/^c". By definition of c" and 
Taylor-Lagrange theorem, 

Pi{x) < P^{Xo) + VPiiXo) ■ {X - Xo) + ^C"||X - XqH'. 



Using (1121) we get: 



Piix) < ^—^Piixo) + 7 (Piixo) - VP,(xo) ■ Xo) + IvPiixo) ■ ri{t) + ^. 

t t t V 
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Since Pi is convex and -Pi(O) = 0, we have: 



P,(xo) - VP.(xo) ■ xo = Pi{xo) + VP.(xo) ■ (0 - xo) < P^{0) = 0. 

Therefore 

t — 1 1 C 

P^{X) < -—P^iXo) + -VPi(Xo) " + -, 
it V 

SO that 

= tP.(x) - (t - l)Piixo) < VP.(xo) ■ n + y- 

To prove that E[^{t)\h{t - 1)] < C/t, it suffices to show that P[VPi(xo) ■ r,|/i(t - 1)] = 0. 
To see this, note that E[ri^k\h(t — 1)] = Ui{k,a-i{t)) — Ui{qi{t) , a-i(t)) hence 

E[qi{t) ■ ri\h{t - 1)] = qi{t) ■ E[ri\h{t - 1)] = ^ qi,k['>^i{k, a_i) - Ui{qi, a_i)] = 0. 

Since qi{t) is proportional to VPi(xo), the result follows. ■ 
Proof of Claim ^ By we have 

1 f 

PMit +n)) = —-Y, m + -—p^iHt)) 

t + n ^-^ t + n 

s=t+l 

-, t+n t+n 

t + n^^^ t + n^^^ t + n 

where ({s) = ^{s) — E[^{s)\h{s — 1)]. As we have assumed z(t) G IA^{B), we have 

^ P,(P,(t)) < -^7 < 7- 



t+n t+n 
Next, by Lemma [HI 

v Emus - 1)] < ^ E ^ < c ^<'+^)-^^\ 

t + n ^ 

s=t+l s=t+l 

Maximizing among all n > yields < {te)~^, hence 

t+n ^ 

— EM(-1)I<^. 

s=t+l 

Let d"^ be a bound on \^ar[C(s)] for all s (this bound exists, since by Lemma |B] variables 
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^(s) are uniformly bounded). Applying Lemma [TJ we obtain that for every tt > and 
every t, 



Pr 



max 



t+n 



t + n 



s=t+l 



> 



a 



tlx 12 



< 



TT 



Hence with probability at least 1 — 7r/2 the following holds for all n G N*, 

P,(i?,(t + n))<^ + ^ + 7. 



't-n te 

and it holds for both i = 1,2 simultaneously with probability at least (1 — 7r/2)^ > 1 — tt. 

/2a- I C 
/toTT toe 

Pi{Ri{t + n)) < 7s for each i = 1, 2 and all n G N* 



Choosing to = toij^^ l) so large that + ^ < 7b — 7, we obtain that event Et- 



occurs with probability at least (1 — 7r/2)^>l — vr. 



References 

[1] J.-P. Aubin and A. Celina. Differential Inclusions. Springer, 1984. 

[2] D. Balkenborg, J. Hofbauer, and C. Kuzmics. Refined best reply correspondence and 
dynamics. Theoretical Econ., forthcoming. 

[3] K. Basu and J. W. WeibuU. Strategy subsets closed under rational behavior. Econ. 
Letters 36 (1991), 141-146. 

[4] M. Benaim and M. Faure. Consistency of vanishing smooth fictitious play. Working 
paper, available at http://arxiv.org/abs/1105.1690, 2011. 

[5] M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential 
inclusions. SIAM J. Control and Optimization 44 (2005), 328-348. 

[6] M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential 
inclusions. Part II: Applications. Math. Operations Res. 31 (2006), 673-695. 

[7] U. Berger. Fictitious play in 2 x n games. J. Econ. Theory 120 (2005), 139-154. 

[8] U. Berger. Two more classes of games with the continuous-time fictitious play prop- 
erty. Games Econ. Behav. 60 (2007), 247-261. 

[9] D. Blackwell. An analog of the minmax theorem for vector payoffs. Pacific J. Math. 
6 (1956), 1-8. 

[10] A. Blum, E. Even-Dar, and K. Ligett. Routing without regret: on convergence to 
Nash equilibria of regret-minimizing algorithms in routing games. In Proceed. 25th 
Annual ACM Symposium on Principles of Distributed Computing, pp. 45-52, 2006. 



29 



[11] G. Brown. Iterative solutions of games by fictitious play. In T. Koopmans (Ed.), 
Activity Analysis of Production and Allocation, Vol. 13 of Cowles Commission Mono- 
graph, pp. 374-376. New York: Wiley, 1951. 

[12] P. BuUen. A Dictionary of Inequalities. Addison Wesley Longman, Harlow, 1998. 

[13] N. Cesa-Bianchi and G. Lugosi. Potential-based algorithms in on-line prediction and 
game theory. Machine Learning 51 (2003), 239-261. 

[14] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge Univ. 
Press, 2006. 

[15] Y. Chen and J. W. Vaughan. A new understanding of prediction markets via no-regret 
learning. In Proceed. 11th ACM Conference on Electronic Commerce, pp. 189-198, 
2010. 

[16] R. T. Clemen and R. L. Winkler. Aggregating probability distributions. In W. Ed- 
wards, R. Miles, and D. von Winterfeldt (Eds.), Advances Dec. Anal., pp. 154-176. 
Cambridge Univ. Press, 2007. 

[17] P. DeMarzo, I. Kremer, and Y. Mansour. Online trading algorithms and robust 
option pricing. In Proceed. 38th Annual ACM Symposium on Theory of Computing, 
pp. 477-486, 2006. 

[18] M. Faure and G. Roth. Stochastic approximations of set-valued dynamical systems: 
convergence with positive probability to an attractor. Math. Operations Res. 35 
(2010), 624-640. 

[19] D. Foster and R. Vohra. A randomization rule for selecting forecasts. Operations 
Res. 41 (1993), 704-709. 

[20] D. Foster and R. Vohra. Regret in the online decision problem. Games Econ. Behav. 
29 (1999), 7-35. 

[21] Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. 
Games Econ. Behav. 29 (1999), 79-103. 

[22] D. Fudenberg and D. Levine. Consistency and cautious fictitious play. J. Econ. 
Dynam. Control 19 (1995), 1065-1089. 

[23] A. Gaunersdorfer and J. Hofbauer. Fictitious play, Shapley polygons, and the repli- 
cator equation. Games Econ. Behav. 11 (1995), 279-303. 

[24] I. Gilboa and A. Matsui. Social stability and equihbrium. Econometrica 59 (1991), 
859-867. 

[25] J. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. W. 
Tucker, and P. Wolfe (Eds.), Contributions to the Theory of Games, Vol. Ill, Ann. 
Math. Stud. 39, pp. 97-139. Princeton Univ. Press, 1957. 

[26] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equi- 
librium. Econometrica 68 (2000), 1127-1150. 



30 



[27] S. Hart and A. Mas-Colell. A general class of adaptive procedures. J. Econ. Theory 
98 (2001), 26-54. 

[28] S. Hart and A. Mas-Colell. Continuous-time regret-based dynamics. Games Econ. 
Behav. 45 (2003), 375-394. 

[29] J. Hofbauer. Stability for the best response dynamics. Mimeo, 1995. 

[30] J. Hofbauer and W. H. Sandholm. Stable games and their dynamics. J. Econ. Theory 
144 (2009), 1665-1693. 

[31] J. Hofbauer and K. Sigmund. Evolutionary Games and Population Dynamics. Cam- 
bridge Univ. Press, 1998. 

[32] J. Hofbauer and S. Sorin. Best response dynamics for continuous zero-sum games. 
Discrete and Continuous Dynamical Systems, Series B, 6 (2006) 215-224. 

[33] J. Hofbauer, S. Sorin, and Y. Viossat. Time average replicator and best reply dy- 
namics. Math. Operations Res. 34 (2009), 263-269. 

[34] J. Hofbauer and J. W. WeibuU. Evolutionary selection against dominated strategies. 
J. Econ. Theory 71 (1996), 558-573. 

[35] S. Irani and A. Karlin. On-line computation. In D. Hochbaum (Ed.), Approximation 
Algorithms for NP-Hard Problems, pp. 521-564. Boston: PWS-Kent, 1996. 

[36] V. Krishna and T. Sjostrom. On the convergence of fictitious play. Math. Operations 
Res. 23 (1998), 479-511. 

[37] R. P. Larrick and J. B. Soil. Intuitions about combining opinions: misappreciation 
of the averaging principle. Manage. Sci. 52 (2006), 111-127. 

[38] E. Lehrer. A wide range no-regret theorem. Games Econ. Behav. 42 (2003), 101-115. 

[39] N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and 
Computation 108 (1994), 212-261. 

[40] Y. Mansour. Regret minimization and job scheduling. In Proceed. 36th Conference 
on Current Trends in Theory and Practice of Computer Science, pp. 71-76. Springer, 
2010. 

[41] A. Matsui. Best response dynamics and socially stable strategies. J. Econ. Theory 
57 (1992), 343-362. 

[42] D. Monderer, D. Samet, and A. Sela. Belief affirming in learning processes. J. Econ. 
Theory 73 (1997), 438-452. 

[43] H. Moulin and J. P. Vial. Strategically zero-sum games: the class of games whose 
completely mixed equilibria cannot be improved upon. Int. J. Game Theory 7 (1978), 
201-221. 

[44] R. Selten. An axiomatic theory of a risk dominance measure for bipolar games with 
linear incentives. Games Econ. Behav. 8 (1995), 213-263. 



31 



[45] L. S. Shapley. Some topics in two-person games. In M. Dresher, L. S. Shapley, and 
A. W. Tucker, editors, Advances in Game Theory, pp. 1-28. Princeton Univ. Press, 
1964. 

[46] C. Sparrow, S. van Strien, and C. Harris. Fictitious play in 3 x 3 games: The 
transition between periodic and chaotic behaviour. Games Econ. Behav. 63 (2008), 
259-291. 

[47] A. Timmerman. Forecast combinations. In G. Elliott, C. W. Granger, and A. Tim- 
mermann (Eds.), Handbook of Economic Forecasting. Elsevier, 2006. 

[48] J. Tirole. The Theory of Industrial Organization. MIT Press, 1988. 

[49] J. W. WeibuU. Evolutionary Game Theory. Cambridge Univ. Press, 1995. 

[50] E. Yanovskaya. Equilibrium points in polymatrix games (in Russian). Litovskii 
Matematicheskii Sbornik 8 (1968), 381-384. 

[51] H. P. Young. The evolution of conventions. Econometrica 61 (1993), 57-84. 

[52] H. P. Young. Strategic Learning and Its Limits. Oxford Univ. Press, 2004. 



32 



